.. This file is part of lazr.uri. lazr.uri is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, version 3 of the License. lazr.uri is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with lazr.uri. If not, see . lazr.uri ******** The lazr.uri package includes code for parsing and dealing with URIs. >>> import lazr.uri >>> print 'VERSION:', lazr.uri.__version__ VERSION: ... ============= The URI class ============= >>> from lazr.uri import URI >>> uri1 = URI('http://localhost/foo/bar?123') >>> uri2 = URI('http://localhost/foo/bar/baz') >>> uri1.contains(uri2) True These next two are equivalent, so the answer should be True, even through the "outside" one is shorter than the "inside" one. >>> uri1 = URI('http://localhost/foo/bar/') >>> uri2 = URI('http://localhost/foo/bar') >>> uri1.contains(uri2) True The next two are exactly the same. We consider a url to be inside itself. >>> uri1 = URI('http://localhost/foo/bar/') >>> uri2 = URI('http://localhost/foo/bar/') >>> uri1.contains(uri2) True In the next case, the string of url2 starts with the string of url1. But, because url2 continues within the same path step, url2 is not inside url1. >>> uri1 = URI('http://localhost/foo/ba') >>> uri2 = URI('http://localhost/foo/bar') >>> uri1.contains(uri2) False Here, url2 is url1 plus an extra path step. So, url2 is inside url1. >>> uri1 = URI('http://localhost/foo/bar/') >>> uri2 = URI('http://localhost/foo/bar/baz') >>> uri1.contains(uri2) True Once the URI is parsed, its parts are accessible. >>> uri = URI('https://fish.tree:8666/blee/blah') >>> uri.scheme 'https' >>> uri.host 'fish.tree' >>> uri.port '8666' >>> uri.authority 'fish.tree:8666' >>> uri.path '/blee/blah' >>> uri = URI('https://localhost/blee/blah') >>> uri.scheme 'https' >>> uri.host 'localhost' >>> uri.port is None True >>> uri.authority 'localhost' >>> uri.path '/blee/blah' The grammar from RFC 3986 does not allow for square brackets in the query component, but Section 3.4 does say how such delimeter characters should be handled if found in the component. >>> uri = URI('http://www.apple.com/store?delivery=[slow]#horse+cart') >>> uri.scheme 'http' >>> uri.host 'www.apple.com' >>> uri.port is None True >>> uri.path '/store' >>> uri.query 'delivery=[slow]' >>> uri.fragment 'horse+cart' ==================== Finding URIs in Text ==================== lazr.uri also knows how to retrieve a list of URIs from a block of text. This is intended for uses like finding bug tracker URIs or similar. The find_uris_in_text() function returns an iterator that yields URI objects for each URI found in the text. Note that the returned URIs have been canonicalised by the URI class: >>> from lazr.uri import find_uris_in_text >>> text = ''' ... A list of URIs: ... * http://localhost/a/b ... * http://launchpad.net ... * MAILTO:joe@example.com ... * xmpp:fred@example.org ... * http://bazaar.launchpad.net/%7ename12/firefox/foo ... * http://somewhere.in/time?track=[02]#wasted-years ... ''' >>> for uri in find_uris_in_text(text): ... print uri http://localhost/a/b http://launchpad.net/ mailto:joe@example.com xmpp:fred@example.org http://bazaar.launchpad.net/~name12/firefox/foo http://somewhere.in/time?track=[02]#wasted-years =============== Other Documents =============== .. toctree:: :glob: * docs/*